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ABSTRACT 

While the literature has been fairly dense in the areas of 
scene understanding and semantic labeling there have been 
few works that make use of motion cues to embellish seman¬ 
tic performance and vice versa. In this paper, we address 
the problem of semantic motion segmentation, and show how 
semantic and motion priors augments performance. We pro¬ 
pose an algorithm that jointly infers the semantic class and 
motion labels of an object. Integrating semantic, geometric 
and optical flow based constraints into a dense CRF-model 
we infer both the object class as well as motion class, for 
each pixel. We found improvement in performance using a 
fully connected CRF as compared to a standard clique-based 
CRFs. For inference, we use a Mean Field approximation 
based algorithm. Our method outperforms recently pro¬ 
posed motion detection algorithms and also improves the se¬ 
mantic labeling compared to the state-of-the-art Automatic 
Labeling Environment algorithm on the challenging KITTI 
dataset especially for object classes such as pedestrians and 
cars that are critical to an outdoor robotic navigation sce¬ 
nario. 

1. INTRODUCTION 

Using object class and motion cues jointly provides for an 
enhanced understanding of the scene. We perceive better 
when we describe the scene in terms of a moving or station¬ 
ary car (pedestrian) than in terms of presence of only few 
object classes. Motivated by this fact, we have formulated 
the problem of object class segmentation, which assigns an 
object label such as road or building to every pixel in the im¬ 
age, and motion segmentation, in which every pixel within 
the image is labelled moving or stationary, jointly. Semantic 
motion segmentation has its application in robotics where an 
autonomous system will be in a better position to plan its 
path based on the joint knowledge. 

In this work, we propose a method to model the whole 
image scene using a fully connected multi-label Conditional 
random field (CRF) with joint learning and inference. To 
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justify the fact that these problem need to be solved jointly, 
we find a relation between the layers. As correct labelling of 
object class can infer the motion labels for the corresponding 
pixels and motion in the image improves the inference of 
the object labelling. To provide some intuition behind this 
statement, note that the object class boundaries are more 
likely to occur at a sudden transition in motion and vice- 
versa. Moreover, the class of the object provides a very 
important clue for motion analysis. For example, in a scene 
we can assume that the probability of a car or person moving 
is greater than the probability of a moving wall or a moving 
road. 

We use sequential stereo pairs from three time instants 
to label the scene and estimate the motion, showing the ro¬ 
bustness in our motion segmentation method. The interac¬ 
tion between the semantic labelling and motion likelihood is 
learnt, which helps us to efficiently segment the distant mov¬ 
ing objects in few time instants. Each image pixel is labelled 
with both an object class and motion estimate. Various ap¬ 
proximate methods for inference exist, such as maximum a 
posterior methods (e.g graph-cuts), or variational methods, 
such as mean-field approximation, which allow us to approx¬ 
imately estimate a maximum posterior marginals solution 
(MPM). We have used mean field based inference algorithm 
as it enables us to utilize efficient approximations for high¬ 
dimensional filtering, which reduce the complexity of mes¬ 
sage passing from quadratic to linear, resulting in inference 
that is linear in the number of variables and sub-linear in 
the number of edges. 

Herein we show that joint labeling formulation is mutu¬ 
ally beneficial for motion as well as semantic labeling. Our 
method is similar to where they have used a 

multi layer multi-lab el CRE for joint estimation of scene 
reconstruction and attributes respectively. Specifically we 
show significant performance gain for motion labeling in 
the challenging KITTI street datasets in comparison to the 
state of the art methods in motion segmentation [^. Con¬ 
currently we also improve the performance of ALE by 
showing segmentation results closer to ground truth espe¬ 
cially for pedestrians and cars. We accomplish this using 
motion likelihood estimates and incorporate semantics to 
get a better holistic understanding of the dynamic scene. 
These results show that this approach closely mimics per¬ 
ception by humans where semantic labels play an important 
role to identify motion. 

2. RELATED WORK 

There has been fairly large amount of literature in both 
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Figure 1: Illustration of the proposed method .The system takes a sequence of rectified stereo images from 
the tracking dataset of KITTI (A).Our formulation computes the Object class probabilities (B) and motion 
likelihood (E) using disparity map(D) and optical fiow(C).These are input into a joint formulation which 
exploits the object class and motion co-dependencies by allowing a interact between them (F).The inference 
is computed using the mean field approximation method to give a joint label to each pixel(G). Best viewed 
in color. 


Semantic and Motion segmentation. For semantic image 
segmentation, existing approaches use textonboost [^, in 
which weakly predictive features in a image like color, loca¬ 
tion and texton features are passed through a classifier to 
give the cost of a label for that particular pixel. These costs 
are combined in a contrast sensitive Conditional Random 
field [^. Most of the mid-level inferences do not use 
pixels directly, but segment the image into regions EE i 
p^ . Substantial state of the art results for dense semantic 
image segmentation have been show using superpixel based 
hierarchical framework [^[^. Recently a lot of the scene 
understanding research has gone into better understanding 
the scene using different parameters to get a better seg¬ 
mentation. In [^,[^,[^ Ladicky et al have used Object 
detectors. Co-occurrence statistics and Stereo disparity for 
improving the semantics. In [^, Yao et al have combined 
semantic segmentation, object detection and scene classifi¬ 
cation for understanding a scene as a whole . 

Motion segmentation has been approached using geomet¬ 
ric priors mostly from a video. General paradigm involves 
using Geometric constraints or reducing the model to 
affine to cluster the trajectories into subspaces [^. These 
methods have been shown not to work in complex environ¬ 
ments where the moving cars fie in the same subspace. We 
consider deviation of the trajectories based on the 3d motion 
of the camera estimated from the trajectories to provide us 
motion likelihood even in challenging scenarios. Semantics 


for motion detection is not new, Wedel et al have used 
scene flow to segment motion in a stereo camera. There has 
been numerous work on segmenting moving object by com¬ 
pensating for the platform movement pY] .Recently 
in motion features have been learnt using deep learning to 
give better motion likelihood estimate. The applications for 
understanding motion semantics has been an emerging area 
and can be used to understand and model traffic pattern [4] 

Hi 

The structure of the paper is as follows: Section [^formu¬ 
lates the CRF’s for dense image labelling, and describe how 
they can be applied to the the problem of object class seg¬ 
mentation and motion segmentation. Sectionj^describes the 
joint formulation allowing for the joint optimization of these 
problems, while Section [^describes the mean field inference 
for the joint optimization. The learning for the class and 
motion correlation is described in Section [fi] We evaluate 
and compare our algorithm in Section 

3. MULTI-LABEL CRF FORMULATION 

Our joint optimisation consists of two parts, object class 
segmentation and motion segmentation. We introduce the 
terms to be used in the paper. We define a dense CRF 
where the set of random variables Z = {Zi, Z 2 ,...., Zw} cor¬ 
responds to the set of all image pixels i E V = {1,2,..., A} . 
Let Af be the neighbourhood system of the random field de¬ 
fined by the sets A/^Vz G V , where JVi denotes the neighbours 













of the variable Zi . Any possible assignment of labels to the vector of the pixel given from the motion of the camera, 

random variables will be called a labelling and denoted by Thus unary potential is given as: 


3.1 Dense Multi-class CRF 

We formulate the problem of object class segmentation as 
hnding a minimal cost labelling of a CRF defined over a set 
of random variables X — X 2 ,each taking a state 

from the label space C = {T,/ 2 , ....,/fc}, where k represents 
the number of object class labels. Each label I indicates a 
different object class such as car, road ,building or sky.These 
energies are: 

= ( 1 ) 
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The unary potential 'ijjf [xi) describes the cost of the pixel 
taking the corresponding label. The pairwise potential en¬ 
courages similar pixels to have the same label. The unary 
potential term is computed for each pixel using pre-trained 
models of the color, texture and location features for each 
object .In a typical graph topology, we consider a 4 or 
8 neighbour connected network. With the mean held infer¬ 
ence algorithm it is possible to use a fully connected graph, 
where all the pixels in the image are interconnected given 
certain forms of pairwise potential.Therefore, the pairwise 
potential takes the form of a potts model: 
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For a fully connected graph topology, p{i,j) is given as: 
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Where pi indicates the location of the ith pixel, U indicates 
the intensity of the ith pixel, and O/sfip fiv are the model 
parameters learned from the training data. 


3.2 Dense Motion CRF 

We use a standard dense CRF for formulating the seman¬ 
tic motion segmentation .The problem is posed as hnding a 
minimal cost labelling of a CRF over a set of random vari¬ 
ables y = {yi, ^ 2 , Vm} which can take the label of moving 
or stationary i.e M — {mi,m 2 } . where mi represents all 
the stationary pixels in the image and m 2 corresponds to 
the moving pixels. The formulation for motion is as follows: 


E^iv) = F ’^i^iy^yj) (4) 
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Where the unary potential is given by the motion 

likelihood of the pixel and is computed as the difference 
between the predicted how and optical how. The predicted 
how is given by : 


X' = KRK'X + KT/z (5) 

where K is given as the Intrinsic camera matrix , R and T 
are the translation and rotation of the camera respectively, 
z is the depth of the pixel from camera. X is the location of 
the pixel in image coordinates and X' is the predicted how 


^P^{xi) = {{X'-XYt.-\X'-X')) (6) 


Where E is called the covariance matrix which is the sum 
of covariance of optical how and the covariance of mea¬ 
sured optical how. Here X' — X' represents the difference of 
the predicted how and optical how. The pairwise potential 
is given as the relationship between neighbour¬ 
ing pixels and encourages the adjacent pixels in the image 
to have similar how. The cost of the function is dehned as: 
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where g{i,j) is a edge feature based on the difference be¬ 
tween the how of the neighbouring pixels: 


yihj) = \fiyi) - fivhl ( 8 ) 


where f(.) is dehned as the function which returns the how 
vector of the corresponding pixel. 


4. JOINT CRF FORMULATION 

In this section, we try to use object class segmentation and 
motion estimate to jointly estimate the label of the dynamic 
scene . Each random variable Zi = [Xi,Yi] takes a label Zi 
= [xi,yi], from the product space of object class and motion 
labels and correspond to the variable Zi taking a object label 
Xi and motion yi. In general the energy of the CRF for joint 
estimation is written as : 


+ E ’^iji^hZj) (9) 
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where 'ipf , 'ipfj are the sum of the previously mentioned 
terms and '0^^, and 'ipfj respectively, we include 
some extra terms which help in understanding the relation 
between the labels of X , Y . In the real world scenarios there 
is a relationship between the object class and corresponding 
motion likelihood for each pixel, we compute an interactive 
unary and pairwise potential terms so that a joint inference 
can be performed. 

4.1 Joint unary potential 

The unary potential ipf (zi) can be defined as an inter- 
active potential term which incorporates a relationship be¬ 
tween the object class and the corresponding motion like¬ 
lihood. we can directly take the relationship between the 
object class and all the possible motion models as a measure 
to calculate the joint unary potential. As this requires large 
amount of training data to incorporate all motion models 
for each the class, we look at class and motion correlation 
function which incorporates the class-motion compatibility 
and can be expressed as : 

■^?,l^ixi,yj) = (10) 

Here A(/, m) G [—1,1] is a learnt correlation term between 
the motion and object class label.The combined unary po¬ 
tential of the joint CRF is given as follows: 

Yi,m([a;i,2/i]) = Ip? (xi) + ivi) +ip?i?Lixi,yi) (11) 





where and , are the unary potentials previously 
discussed for object class and motion likelihood of a pixel i 
given the image. 

4.2 Joint pairwise potential 

The joint pairwise potential enforces the con¬ 

sistency of object class and motion between the neighbouring 
pixels. This potential term exploits the condition that ,when 
there is a change in the motion layer , then there is a high 
chance for the label of the object class to change .Similarly 
if there is a change in the label in the object class then it is 
more likely for the label in the motion layer to change .To 
include this behaviour in our formulation we have taken the 
joint pairwise term as: 

i’fj{[xi,yi], [xj,yj]) = ip?j{xi,xj) +ip{fiyi,yj) (12) 

Here 'ijjfj{xi^Xj) and have been defined earlier as 

the pairwise terms of object class and motion respectively. 

5. INFERENCE 

The inference has been a challenging problem for large 
scale CRTs. We follow Krahenbuhl et al [11] , which uses 

a mean field approximation approach for inference . The 
mean-field approximation introduces an alternative distribu¬ 
tion over the random variables of the CRF,(5i(zi) , where the 
marginals are forced to be independent Q{z) = 

The mean-field approximation then attempts to minimize 
the KL-divergence between the Q and the true distribution 
P. We can therefore take Qi(zi) = Q?(yi) . Here 
QP is a multi-class distribution over the object labels , and 
is a binary distribution over moving or stationary given 
by {0,1}. 

Qf[xi = 1) = XjZi exp{-V',° {xi) 

- y] {xj = i').ip°{xi,xj) 

I'ec i^j 
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The inference for the Motion layer is similar to the object 
class layer and is given by: 

Qi^iVi = m) = IjZi ivi) 

~ Xi Xi^f (%■ = 

m'EM. i^j 

- Xi 'S? = 'fn).ii}f^i^{xi,yi)} 
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Where Zi is given as the normalization factor , and m G 
{0,1}. As proposed in [^, Using n + m Gaussian convo¬ 
lutions we can efficiently evaluate the pairwise summations 
which are given as Potts model. 

6 . LEARNING 

We learn the parameters for the label and motion in this 
section. We describe a piecewise method for training the 
label and motion correlation matrices. In the model de¬ 
scribed, we train for the matrix simultaneously by learning 
an (n + 2)^ correlation matrix. 

We use the modified adaboost framework implemented in 


[27] . For training we denote the training dataset of N in¬ 
stances of pixels or regions as V = {(ti, ^i), (t 2 , ^ 2 ), 

(tAT, w))- Here, ti is a feature vector for the z-th instance 
and Zi = [xi,yi] is an indicator vector of length n + 2 , where 
Xi{l) = 1 implies that the class label is associated with the 
pixel or region instance of i and Xi(l) = — 1 represents that 
the class is not associated with the instance i and similarly 
for yi{m) = 1 and yi{m) = — 1 represents the association 
of motion m for the instance i. Therefore, Zi represents the 
object class and motion ground truth information for the 
instance i. 

In the following approach, we show how to compute X{l,m) 
.The boosting approach in generates a strong classifier 
Hs^i{t) for each object class I and each round of boosting 
s = I, 2, 3,...., S'.These strong classifiers can be defined as: 

Hs,i{t) = ^ as,ihs,i{t) 

s = l,2,...,S 

Here hs,i are weak classifiers , and as,i are the non-negative 
weights set by the boosting algorithm.As proposed in [^ , we 
use their joint learning approach, which generates a sequence 
of reuse weights m) for each class and motion at¬ 

tributes /,m at each iteration s. These represent the weight 
given to the strong classifier for motion label m in round 
s — I in the classifier for I at round s. Using the following 
reuse weights and the strong classifiers we can calculate the 
label correlation : 

A(/, ?Tl) = ^ ^ (77s_i^77t)) 77s —l,m)) 

s=2,...S 

This learning approach incorporates information about the 
motion likelihood and appearance relationship between mo¬ 
tion and objects. 

7. EXPERIMENTAL EVALUATION 

We have used a popular street-level dataset— KITTI for 
evaluation. It consists of several sequences collected by a 
car-mounted camera driving in urban,residential and high¬ 
way environments, making it a varied and challenging real 
world dataset. We have taken 6 sequences from the tracking 
dataset of KITTI each containing 20 stereo images, each of 
size 1024 x 365. Firstly, these sequences were manually an¬ 
notated with the II object classes containing the spectrum 
of classes. Secondly, each of the image was annotated with 
moving and non-moving objects using the tracking ground 
truth data. These sequences are challenging as they contain 
multiple moving cars and the labels consisted of II classes 
i.e Pavement, Car, Signal, Sky, Poles, Pedestrian, Fenee, 
Building, Vegetation and Road. We have selected KITTI 
dataset as it contains stereo image pairs with a wide base¬ 
line. We Learned the motion compatibility, as simple lookup 
table would not work due to instances where the semantic 
prior is wrong. The hard negatives provide us with the abil¬ 
ity to categorically remove objects with wrong motion like¬ 
lihood a common occurrence due to inconsistent disparity. 
This would allow us to test on a variety of datasets with¬ 
out needing to train for similar classes . The dataset was 
also chosen with the view to showcase the algorithmaAZs 
capability on degenerate cases which are not commonly ad¬ 
dressed in other datasets. 

We have used semi-global block matching disparity 
map computation algorithm for the disparity computation 
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Figure 2: Qualitative object class and motion results for the KITTI dataset 1) Images of three sequences 
of KITTI dataset (INPUT) 2)Ground truth of object class segmentation (GT-O) 3)Object class segmentation 
results using fully connected CRF(FULL-C) 4)Object class segmentation using the Joint formulation of the 
proposed method (OURS-O) 5) Ground truth of the motion segmentation (GT-M) 6)Motion segmentation 
using geometric constraints (GEO-M) |18| 7) Proposed method dense motion segmentation(OURS-M).For 
motion segmentation blue depicts stationary and red pixels represent moving, best viewed in color 


in the stereo camera sequence. For the computation of the 
motion of the moving camera, we have used RANSAC based 
algorithm to solve for the Eq. We have added the tem¬ 
poral consistency of motion across 3 images to the likelihood 
estimate which improves the results. As for the dense optical 
flow computation in the implementation, we have used the 
Deepflow algorithm from [^, which has given state-of-the 
art results for the KITTI evaluation benchmark. For object 
class segmentation we have used the publicly available Tex- 
ton boost classifier to compute the unary potentials for each 
class. 

Qualitative evaluation : We show our results in com¬ 
parison to Ground truth, in semantic segmentation with 
FULL-C and in motion to GEO-M Fig 2. In Sequence 1, 
FULL-G isn’t able to segment cars as a whole and miss 
out on several patches while GEO-M fails in the case due 
to degeneracy in motion. Sequence 3, has patches on the 
front car due to failure of disparity computation, also the 
car’s window is wrongly classified by ALE. These things are 


corrected by our method. The motion consistency helps in 
removing the window patch while in Sequence 1 the degen¬ 
eracy is handled by our motion estimator independent of 
such geometric constraints. The joint formulation captures 
the motion and improves the semantics of the image. We 
also show an improvement in the segmentation where the 
disparity computation has failed as show in the Sequence 
1. We take a specific example in Fig showing Pedestri¬ 
ans, the image is smoothed out by the fully connected dense 
GRF leading to the wrong labelling while our method is able 
to correctly segment the whole pedestrian. This again re¬ 
iterates the use of motion correlation for a better labelling. 
In all the above cases, we can see the effectiveness of our 
algorithm in handling the motion to generate a dynamic se¬ 
mantic model of the scene. We tried using motion cues as a 
feature in the object class unary. This couldnaAZt be used 
as a discriminative feature for an object class, as objects can 
have different motions which can not be learned through tex- 
tonboost. We have implemented the Fully connected GRF 
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Figure 3: In the figure we do a comparative 

evaluation between the results of Full-C and our 
method.The original image(l) is taken from the 
KITTI dataset. The output of the Full-C(2) is de¬ 
picted which shows a wrong labelling of the pedes¬ 
trian pixels in the image. The Results of the pro¬ 
posed method(3) depict the improvement in the se¬ 
mantic segmentation. 


module as it was showing substantial improvement in results 
for specific classes like pedestrian compared to a superpixel 
clique based CRF model. 

Quantitative evaluation : We quantitavely compare 
our approach against the other state-of-the-art image seg¬ 
mentation approaches, including pairwise CRF semantic seg¬ 
mentation approach with super-pixel based higher orders 
(AHCRF), Fully connected CRF (Full-C) and joint motion- 
object CRF with superpixel-clique consistency (JAHCRF). 
The quantitative evaluation of the object class segmentation 
of our joint optimization method with respect to other ap¬ 
proaches is summarized in Table Evaluation is performed 
by cross verifying each classified pixel with the Ground truth 
.We choose the average intersection/union as the evaluation 
measure for both the image segmentation and the motion 
segmentation.lt is defined as TP/(TP-\-FP-\- FN), where 
TP represents the true positive ,FP the false positive and 
FN as the false negative. We observe an increase in perfor¬ 
mance for most of the classes in each of these measurements, 
mainly the object classes car and person have shown sub¬ 
stantial improvements in accuracy. This is attributed to the 
fact that motion can be associated with specific classes and 
the pairwise connections in the motion domain respect the 
continuity in optical flow, while in the image domain,the 
connections between neighbouring pixels might violate the 
occlusion boundaries. 

The evaluation of motion segmentation is summarized in 
Table [^We compute our motion accuracy evaluation simi¬ 
lar to the object class segmentation. We compare our re¬ 
sults with geometric-based motion segmentation(GEO-M) 
[18] , and joint labelling of motion and the superpixel based 


Method 

Moving Stationary 

GEO-M 18 

46.5 49.8 

AHCRE+Motion 

60.2 75.8 

OURS-M 

73.5 82.4 


Table 1: Quantitative analysis of motion seg¬ 
mentation for the Kitti dataset. We have com¬ 
pared our results(OURS-M) with geometric based 
motion segmentation(GEO-M) and joint optimiza¬ 
tion with superpixel based clique and motion esti- 
mate(AHCRF+Motion) 


image segmentation results(AHCRE+Motion). We observe 
a increase in the accuracy of the motion segmentation. This 
increase in efficiency in the results is due to the fact that, 
we have incorporated the label and motion correlation. The 
fact that the possibility of a moving wall and moving tree is 
less compared to a moving car or a moving person,has been 
exploited.The results of motion segmentation have shown 
substantial improvement over (AHCRE+Motion) can be at¬ 
tributed to the robust pairwise potetials of the dense formu¬ 
lation. 


8 . CONCLUSION AND FUTURE WORK 

In this paper ,we have proposed a joint approach simul¬ 
taneously to predict the motion and object class labels for 
pixels and regions in a given image. The experiments sug¬ 
gest that combining information from motion and objects 
at region and pixel-levels helps semantic image segmenta¬ 
tion .Eurther evaluations also show that per-pixel motion 
segmentation is important in achieveing higher accuracy in 
the motion segmentation results.In order to encourage fu¬ 
ture work and new algorithms in the area we are going to 
make the motion segmentation dataset of the KITTI track¬ 
ing dataset available . 

In the future work. We intend to extend the method by 
segmenting objects with different motion and segment each 
object as a different class.We also plan to achieve the GPU 
implementation for the proposed algorithm and generalize 
the current approach for dynamic scene understanding.We 
will continue expanding the annotations and the data in the 
KITTI tracking dataset. 
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